[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into `cpu/` package by ronensc · Pull Request #37874 · vllm-project/vllm

ronensc · 2026-03-23T09:28:14Z

Purpose

Refactor the CPU KV-cache offloading subsystem to reduce duplication, remove
unnecessary abstraction layers, and improve code organization.

Consolidate LRU/ARC managers using a strategy pattern

arc_manager.py and lru_manager.py duplicated ~40 lines of identical skeleton
code: take_events, event emission, ref-count management in prepare_load/
complete_load, backend.allocate_blocks(), and __init__ boilerplate. These are
now unified in CPUOffloadingManager, with policy-specific logic isolated in
CachePolicy implementations.

Remove the Backend abstraction

The Backend ABC and CPUBackend class added a layer of indirection with no
polymorphism benefit. The block pool logic is now inlined directly into
CPUOffloadingManager as private methods.

Restructure into a cpu/ package

Per reviewer suggestion, the flat files are split into a proper subdirectory with
one responsibility per file:

vllm/v1/kv_offload/cpu/
    manager.py           # CPUOffloadingManager + block pool
    spec.py              # CPUOffloadingSpec
    policies/
        abstract.py      # BlockStatus + CachePolicy ABC
        lru.py           # LRUCachePolicy
        arc.py           # ARCCachePolicy

Key design decisions

CachePolicy abstract base covers both block organization and replacement
decisions — LRU and ARC differ in both, so they cannot be cleanly split
Policy selection uses a _CACHE_POLICIES registry dict — adding a new policy
requires no changes to CPUOffloadingManager
Both evict() implementations are now atomic: candidates are collected
without mutating state; changes only apply if all n evictions can be satisfied
(the original ARC code could partially evict before returning None)
LRUCachePolicy.touch() fixed: if self.blocks.get(hash): → if hash in self.blocks:
(the old check was unreliable for blocks with ref_cnt == 0)

Files deleted: arc_manager.py, lru_manager.py, cpu.py, cpu_manager.py,
backend.py, backends/cpu.py

Test Plan

pytest tests/v1/kv_offload/test_cpu_manager.py -v -s

Test Result

13 passed in 4.88s

cc @orozery @albertoperdomo2

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

gemini-code-assist

Code Review

This pull request is an excellent refactoring that consolidates the LRUOffloadingManager and ARCOffloadingManager into a single CPUOffloadingManager using a strategy pattern with pluggable CachePolicy implementations. This significantly reduces code duplication and improves modularity. A key improvement is making the eviction logic atomic, preventing partial evictions on failure. The implementation of the strategy pattern, including the CachePolicy ABC and the policy registry, is clean and extensible. The changes are well-tested. I have one suggestion to improve the robustness of the LRUCachePolicy.

gemini-code-assist · 2026-03-23T09:30:08Z

vllm/v1/kv_offload/cpu_manager.py

+        self.blocks[block_hash] = block
+
+    def remove(self, block_hash: BlockHash) -> None:
+        del self.blocks[block_hash]


The remove method in LRUCachePolicy uses del self.blocks[block_hash], which will raise a KeyError if the block hash is not present. While the current call site in CPUOffloadingManager.complete_store ensures the key exists before calling remove, making this method more robust would be beneficial for future maintenance and to prevent potential crashes if the calling logic changes. The ARCCachePolicy implementation already uses a safer pop method. Consider using self.blocks.pop(block_hash, None) to silently handle cases where the block is not in the cache, which improves defensiveness.

Suggested change

del self.blocks[block_hash]

self.blocks.pop(block_hash, None)

albertoperdomo2 · 2026-03-23T09:51:22Z

vllm/v1/kv_offload/cpu.py

-                )
+            self._manager = CPUOffloadingManager(
+                backend=backend,
+                cache_policy=cast(Literal["lru", "arc"], self.eviction_policy),


IMO since self.eviction_policy comes from user config (line 59), wouldn't be cleaner to pass the raw string with # type: ignore[arg-type] and let CPUOffloadingManager.__init__ handle the validation? It already raises ValueError for unknown policies, and it feels like cast() silences mypy but doesn't validate at runtime.

Thanks, good point. I'll replace cast() with # type: ignore[arg-type] and rely on the runtime validation in CPUOffloadingManager.__init__.

orozery

I would suggest the following file structure:

vllm/v1/kv_offload/cpu/manager.py
vllm/v1/kv_offload/cpu/spec.py
vllm/v1/kv_offload/cpu/policies/abstract.py
vllm/v1/kv_offload/cpu/policies/lru.py
vllm/v1/kv_offload/cpu/policies/arc.py

@ronensc WDYT?

orozery · 2026-03-23T10:13:54Z

vllm/v1/kv_offload/cpu_manager.py

+
+    def __init__(
+        self,
+        backend: Backend,


We should remove the Backend abstraction and merge the code of CPUBackend inside CPUOffloadingManager.

Thanks, this makes sense. I'll merge CPUBackend into CPUOffloadingManager.

I also like the proposed file structure. A couple of questions:

Should the worker/ directory also be moved under cpu/?

Do you prefer handling the file structure reorganization in this PR, or as a follow-up PR?

Should the worker/ directory also be moved under cpu/?

We need to decide if we want to split directories based on worker/scheduler.
Let's think about that later.

Do you prefer handling the file structure reorganization in this PR, or as a follow-up PR?

Let's do this here.

Done.
@orozery ready for a 2nd review round.

Updated PR title and description

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

orozery

Thanks @ronensc ! LGTM!

…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

@YangShuai52

### What this PR does / why we need it? main2main upgrade to vllm 0324. fix breaks: 1. PR [#37487](vllm-project/vllm#37487) [V0 Deprecation] Refactor kv cache from list to element (c59a132f9) — self.kv_cache from list[tensor]（per virtual engine）changed to tensor 2. PR [#37874](vllm-project/vllm#37874) [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package (e3c6c10ca) — LRUOffloadingManager + CPUBackend been refactor to CPUOffloadingManager 3. PR [#32951](vllm-project/vllm#32951) [Async][Spec Decoding] Zero-bubble async scheduling + spec decoding (fafe76b4a) — a) changes self.positions and self.seq_lens from CpuGpuBuffer to plain GPU tensor; b) change _get_cumsum_and_arange output paramter. Another _prepare_input_ids add num_reqs. 5. PR [#35007](https://github.com/vllm-project/vllm/pull/35007)[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (dc6908ac6) — delete vllm_is_batch_invariant() and const variable VLLM_BATCH_INVARIANT，replace with vllm.envs Know issues: 1.310p Qwen3.5 test failed for qwen3.5 patch failure, see issue: #7976 @YangShuai52 is fixing. ### Does this PR introduce _any_ user-facing change? 1. As Zero Async Scheduler + spec decode needs _compute_slot_mapping_kernel of NPU and corresponding accepted draft token validation delaye suppots see PR #7640 , this PR make this change: when in spec decode case close the async scheduler. In this way, the Main2Main can be developed in parallel with Spec Decode + Async scheduler, util next release version. Co-Authored-By: zhaomingyu <zhaomingyu13@h-partners.com> wangbj127 <wangbj1207@126.com> SidaoY <1024863041@qq.com> 22dimensions <waitingwind@foxmail.com> - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: Your Name <you@example.com> Signed-off-by: wangbj127 <wangbj1207@126.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: wangbj127 <wangbj1207@126.com>

ronensc added 3 commits March 23, 2026 10:46

Refactor LRU and ARC managers into unified CPU manager

63b96f0

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

rename eviction_policy->cache_policy

e00908b

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

cleanup old classes

c7cbadc

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc requested review from ApostaC and orozery as code owners March 23, 2026 09:28

mergify bot added the v1 label Mar 23, 2026

gemini-code-assist bot reviewed Mar 23, 2026

View reviewed changes

albertoperdomo2 reviewed Mar 23, 2026

View reviewed changes

orozery reviewed Mar 23, 2026

View reviewed changes

ronensc added 4 commits March 23, 2026 14:00

Address review: fix type checking

fb0192f

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: merge CPUBackend into CPUOffloadingManager

04360ee

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

minor fix

3f8c960

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: reorganize file tree structure

4fa01c2

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc changed the title ~~[KV Offload] Consolidate LRU/ARC managers into CPUOffloadingManager with pluggable CachePolicy~~ [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package Mar 23, 2026

orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026

orozery approved these changes Mar 24, 2026

View reviewed changes

orozery merged commit e3c6c10 into vllm-project:main Mar 24, 2026
46 checks passed

ronensc deleted the cpu-manager branch March 24, 2026 05:43

pawel-olejniczak mentioned this pull request Mar 24, 2026

[FIX_FOR_VLLM_CUSTOM=14acf429ac08b6d538ca6feb3e06b6d13895804d] Fix CPUOffloadingSpec import path and remove obsolete roberta patch vllm-project/vllm-gaudi#1229

Merged

leo-pony mentioned this pull request Apr 2, 2026

[CI] Main2main upgrade to 0324 vllm-project/vllm-ascend#7787

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into `cpu/` package#37874

[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into `cpu/` package#37874
orozery merged 7 commits intovllm-project:mainfrom
ronensc:cpu-manager

ronensc commented Mar 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 23, 2026

Uh oh!

albertoperdomo2 Mar 23, 2026

Uh oh!

ronensc Mar 23, 2026

Uh oh!

orozery left a comment

Uh oh!

orozery Mar 23, 2026

Uh oh!

ronensc Mar 23, 2026

Uh oh!

orozery Mar 23, 2026

Uh oh!

ronensc Mar 23, 2026

Uh oh!

ronensc Mar 23, 2026

Uh oh!

orozery left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	del self.blocks[block_hash]
	self.blocks.pop(block_hash, None)

Uh oh!

Conversation

ronensc commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

albertoperdomo2 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ronensc Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

orozery Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ronensc Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

orozery Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ronensc Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ronensc Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ronensc commented Mar 23, 2026 •

edited

Loading